Training production parameters of context-dependent phones for speech recognition
نویسنده
چکیده
A representation form of acoustic information in a trained phone library at the production parametric as well as the spectral level is described. The phones are trained in the parametric domain and are transformed to the spectral domain by means of a synthesis procedure. By this twofold description, potentially more powerful procedures for speaker adaptation and generation of unseen triphones can be explored, while the more robust spectral representation can be used for recognition. Context-dependent phones are represented by control parameters to a cascade formant synthesiser. During training, the parameters are extracted using an analysis-by-synthesis technique and the trajectories are approximated by piece-wise linear segments. For recognition, the parameter tracks are transformed to a sequence of spectral subphone states, similar to a Hidden Markov model. Recognition is performed by Viterbi search in a finitestate network. Recognition experiments have been performed on Swedish connected-digit strings pronounced by seven male speakers. In one experiment, unseen triphones were created by concatenating monophones and diphones and interpolating the parameter trajectories between line endpoints. In another, speaker adaptation was based on generalisation of dzflerences of observed triphones from the phone library. With optimum weighting of duration information, the results for cross-speaker recognition, speaker adaptation, and multi-speaker training were 98.5%, 98.9% and 99.1 % correct digit recognition, respectively. Preliminary experiments with created unseen triphones show no improvement. In informal listening tests of resynthesised digit strings from concatenation of trained triphones, the speech has been judged as intelligible, however, far from natural.
منابع مشابه
Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملDiscriminative models for speech recognition
The discriminative approach to speech recognition offers several advantages over the generative, such as a simple introduction of additional dependencies and direct modelling of sentence posterior probabilities/decision boundaries. However, the number of sentences that can possibly be encoded into an observation sequence can be vast, which makes the application of models, such as support vector...
متن کاملMinimum confusibility training of context dependent demiphones
During the last years two di erent approaches have been widely used in order to improve the acoustic modeling in continuous speech recognition systems: discriminative training algorithms and context dependent subword units. However, while the use of each of these techniques leads to much better results than standard maximum likelihood trained phone models, their combination, i.e. discriminative...
متن کاملContext dependent hybrid HMM/ANN systems for large vocabulary continuous speech recognition system
In this paper, hybrid HMM/ANN systems are used to model context dependent phones. In order to reduce the number of parameters as well as to better catch the dynamics of the phonetic segments, we combine (context dependent) diphone models with context independent phone models. Transitions from phone to phone are modeled as generalized context dependent distributions while phonetic units are cont...
متن کامل